Altus Safety Automation · PoC architecture

How it's put together — and why

A walkthrough of the system that takes a signed-off site audit, has Claude cross-reference it against the job's retest sheet, drafts the failure report and audit summary, and lands the finished documents in SharePoint — with a human approval gate before anything leaves the building.

For non-technical stakeholders · 2026-05-19

§ 01 The cloud — Azure where it all runs

Everything runs in one Azure resource group in the UK South region. The choices favour managed, serverless, pay-as-you-go services so the PoC carries no fixed monthly minimum — if no audits arrive in a day, compute cost is effectively zero.

Compute

Container Apps

Every container in the system runs here. Python services scale to zero when idle — n8n stays warm (one replica, always on) because it owns the cron triggers and webhook endpoints.

Why: right-sized for the workload — bursty Python services pay only when working, while the orchestrator is always listening.

Database

Postgres Flexible Server

Smallest tier (B1ms). Holds audit-trail rows, workflow state, replay cache. 7-day point-in-time-recovery.

Why: managed Postgres — no patching, no backups to script.

Image registry

Container Registry

One private registry holds every service image. Pulls authenticated by managed identity.

Why: integrated with Container Apps; no Docker Hub rate-limit pain.

Secrets

Key Vault

API keys (Anthropic, Microsoft Graph), database connection string, JWT signing keys. Never in source code or env files.

Why: rotation, audit, RBAC — standard for any production-bound posture.

Observability

Application Insights

Receives OpenTelemetry traces from every service. Backs the "Altus agent runs" Workbook and Teams alert rules.

Why: Microsoft-native, no separate vendor, queries via KQL.

Identity

Entra ID + UAMIs

Each service has a managed identity. Database, KV, and ACR access are token-based — no shared passwords.

Why: passwordless service-to-service auth; revoke an identity, you revoke its access.

Networking

Container Apps environment + NAT

All services share one VNet. Outbound traffic egresses through a NAT gateway with a known IP — useful for vendor IP allow-lists.

Why: single security perimeter; predictable egress.

Infrastructure-as-code

Bicep

Every Azure resource is declared in Bicep files in the repo. Provisioning is a single command. No click-ops.

Why: reproducible, reviewable, Microsoft-native (no Terraform vendor).

Storage & doc store

SharePoint Online

Not strictly Azure compute, but the same Microsoft tenant. Holds the client-folder structure and the coordinator approval list.

Why: coordinators already live in M365 — no new app to learn.

Likely running cost — for the PoC volume (tens of audits per week, not hundreds): ~£60–£100/month on Azure, plus Anthropic API spend governed by the £20/day cap (typically £50–£200/month in practice). Largest line items: Postgres (~£25/mo), n8n always-on container (~£15/mo), Python services on-demand compute (~£5–£15/mo combined), Application Insights ingestion (~£5–£15/mo). ACR, Key Vault, Storage round to a couple of pounds each. Azure costs are paid directly by Altus on their subscription — this engagement covers the build, not the running tab.

§ 02 At a glance today, in staging

The numbers behind the slide deck. This is one stack, three small services, one skill, one approval surface — deliberately small so a single technical person can build it, hand it over, and walk away from it.

AI runner

agent-svc-skilled — Anthropic Platform, hosted skill

Skills in library

Audit-review v1, hosted by Anthropic, source in git

Stage 3.5

Live

Cache + per-iteration observability merged, smoke green

Document templates

Audit summary & failure report, both editable Word files

Independent services

agent-svc-skilled, docgen-svc, fallback-svc (+ approval-app)

Workflow surface

n8n

Low-code, every box readable by a power user

Cost cap

£20/day

Circuit-breaks all paths if breached

Cloud surface

Azure

UK South, Container Apps, Postgres, SharePoint, App Insights

§ 03 The story why this exists

An Altus safety engineer signs off an audit on iAuditor. Today an offshore admin team copy-pastes that audit against the job's retest sheet, classifies each asset, writes the failure-report narrative for anything flagged, and produces two Word documents. It is slow, inconsistent, and the source of most quality issues that reach the client. This system removes the offshore step, keeps the human approval, and aims for clean documents in minutes.

The trigger

An audit is signed off

iAuditor (SafetyCulture) fires when a site visit is closed. The system polls every 15 minutes; a SharePoint drop-folder is the fallback if the API is unavailable.

The reasoning

Cross-reference & draft

Claude reads the audit and the retest summary sheet, matches assets, classifies pass / pass-with-recommendation / fail / not-tested, and drafts narrative for every flagged item — with verbatim citations back to the audit.

The output

Two Word docs, gated

Audit Summary + Failure Report land in the client's SharePoint folder. A coordinator opens them in Word, edits if needed, and approves in a SharePoint list. Nothing leaves Altus without a name on it.

§ 04 The shape of the system topology

Five small services, one orchestrator, one source of truth for skills, one approval surface. Connections are deliberately few. Every solid line is a place where a power user can see — or change — behaviour.

Boxes inside grouped frames live inside that Azure / M365 surface. Thick arrows = main flow. Dotted arrows = secrets, traces, alerts, CI sync. Pinch / Ctrl+scroll to zoom.

n8n orchestrator Azure Container Apps Azure managed data (Postgres, KV) Azure observability (App Insights) Microsoft 365 tenant External SaaS (Anthropic) External trigger Human

§ 05 What lives where components, in plain terms

Each box on the diagram is a deliberate choice about where work happens and who can touch it. The principle: keep AI in the places that need reasoning, keep everything else mechanical and editable.

Orchestrator

n8n — the low-code conductor

Every workflow — "when an audit is signed off, do X then Y then Z" — lives here as a visual graph. The ops lead can open it in a browser, read the boxes, change the schedule, edit a query.

Polls iAuditor, fans out HTTP calls, branches on conditions
Owns the SharePoint approval poll
Hosts a heartbeat workflow that pings every service every 5 min

Why low-code? A handover-friendly tier: change behaviour without writing Python.

Reasoning

agent-svc-skilled — the AI runner

A small Python service. Takes the audit + retest sheet, asks Claude (using the hosted skill) to cross-reference and draft. Streams the structured JSON back to n8n.

Calls Anthropic /v1/messages with the hosted skill ID
Logs every call (cost, tokens, cache hits, iterations) to Postgres
Cost cap, circuit breaker, replay cache — all enforced server-side
Single, focused service — one job, done well

Doc generation

docgen-svc — Word + SharePoint

Takes the structured JSON from the agent and merges it into Word templates (audit_summary, failure_report). Uploads via Microsoft Graph to the right client folder.

No AI — pure template merge
Templates are editable Word documents in git
Same service will produce RAMS + reminder emails in Phase 2

Fallback

fallback-svc — the manual path

If iAuditor is down or a one-off PDF arrives, an admin drops the file into a SharePoint folder. This service watches that folder and feeds the same pipeline.

Hash-based de-duplication
Parses the retest summary sheet via openpyxl
Quarterly drill keeps it warm

Skill library

altus-safety-skills — the git repo

The source of truth for every skill. Plain markdown (SKILL.md) + JSON schema. A second repo — deliberately separate from the platform code so a power user can edit a skill without touching infrastructure.

CI auto-uploads to Anthropic on every merge
Every PR runs the 6-fixture eval against the real Anthropic API
Power users edit markdown, not Python

Approval gate

approval-app — coordinator view

A small web app (and a SharePoint list, today) that surfaces every AI draft to a coordinator. They open the Word doc, edit if needed, tick approved.

Read-only audit-trail page at /obs/reasoning-calls
The system never sends a client document unapproved
Coordinators stay in M365 where they already live

State

Postgres — audit trail & cache

One small database, three load-bearing tables: workflow_runs, audits, reasoning_calls. Every Claude call is recorded, in full, for the safety regulator angle.

Replay cache — same input ⇒ same output, no second API call
Cost ledger by day, per path
Iteration-level traces from Stage 3.5

§ 06 Why one focused runner deliberate simplicity

A single AI runner — agent-svc-skilled — is the only path real audit traffic crosses. One service to operate, one to deploy, one to understand. The Anthropic Platform's Skills API gives us hosted prompts that scale without per-call bloat, and the skill itself lives in a separate git repo so a power user can edit it without touching the runner.

What this gives us	How
One thing to operate no path-selection logic, no fallback wiring	n8n calls one URL. One Container App. One image to rebuild. The audit trail in `reasoning_calls` has one shape.
Skills scale cheaply progressive disclosure on the Platform	As the skill library grows (RAMS, sanity, etc.), only the metadata is loaded per call. Token cost stays flat.
Power-user edits, safely skill source in a separate git repo	Editing the prompt is a PR against `altus-safety-skills`. CI runs the 6-fixture eval against real Anthropic before merge.
Provider exit is not free, but possible if Anthropic relationship ever sours	The skill body and schema are in git, not in Anthropic. A future swap to a Messages-API-only path (or another provider) is a multi-day rebuild — not a multi-week one.

§ 07 Three design pillars non-negotiables

Every decision in this architecture is in service of three commitments. If a change would weaken one, it's the wrong change.

Low-code visibility

An operations lead with no Python should be able to read every workflow in n8n, every skill in the git repo, every Word template — and edit them with help from any competent low-code developer. Hand-over readiness is a first-class requirement, not an afterthought.

ii.

AI in the right places only

The skills system is reserved for tasks that need reasoning: cross-referencing, classification, narrative drafting. RAMS production, reminder emails, deal-to-job sync — all mechanical, no AI. We are not paying inference cost where a template merge would do.

iii.

Safe by design

Every AI output is captured in an audit trail. Every client document passes a coordinator before it's sent. Cost is capped daily, by code, with a circuit breaker. The fallback path means a vendor outage degrades to slower service, not silently to bad output.

§ 08 Entry & exit points where work begins & ends

There are very few. That's a feature — it means the perimeter is small, security has fewer surfaces to defend, and a stakeholder asking "where does an audit go in?" has a one-sentence answer.

In

Entry · 1

iAuditor poll

Primary trigger. n8n polls the SafetyCulture API every 15 min for audits in signed off state.

→

Entry · 2

SharePoint drop-folder

Fallback. Admin manually exports the audit PDF; fallback-svc watches the folder.

→

Entry · 3 (future)

Webhook or approval-app upload

Open question — choice depends on where audits live operationally. 30-min scoping conversation, not implementation.

Out

Exit · 1

DOCX → SharePoint

Two Word documents per audit, dropped into /Clients/<Client>/<Year>/<Job>/.

→

Exit · 2

Approval list row

A coordinator sees a new row, opens the docs, edits, ticks approved.

→

Exit · 3

Coordinator sends to client

Manual step today. The coordinator is the final filter. Nothing leaves Altus without a name on it.

§ 09 Total oversight where to look

If a stakeholder asks "is the system working?" or "what did Claude actually do for that audit?" — here are the seven places to look. Each tells you something different; together they cover everything.

Surface	What it tells you	Audience
n8n console low-code workflow UI	Every workflow execution, every node's inputs & outputs, success/failure per step. The first place to look on a "what just happened?" question.	Ops lead, dev partner
Azure Workbook "Altus agent runs (stg)"	Cache hit rate, iteration distribution, runaway-loop detection, cost per day per path. Stage 3.5 deliverable, deployed via Bicep.	Dev partner, finance
SharePoint approval list "Altus Audit Approvals"	Every audit waiting for a coordinator. Confidence score, document links, approver, approval date.	Coordinators, ops lead
approval-app / obs route read-only API view	Every Claude call — tokens, cost, skill version, iteration count, container, stop reason. Audit trail for the safety regulator angle.	Dev partner, compliance
Teams alerts channel #altus-ops-alerts	Heartbeat-driven alerts on iAuditor, Anthropic, Graph availability. Three-tier severity (info, warning, action).	Everyone
Anthropic console platform.claude.com	Workspace billing dashboard, Skills versioning UI, rate-limit status.	Dev partner
GitHub repos altus + altus-safety-skills	Every code & skill change as a reviewable PR. CI runs the eval against real Anthropic on every skill PR.	Dev partner, ops lead

§ 10 Power-user handover what an ops lead inherits

The system is built so that Altus's operations lead — not a software engineer — can keep it running, evolve the rules, and bring in a generic low-code developer if a bigger change is needed.

Day-1 — they can

Operate without dev help

Open n8n, watch a workflow execute, re-run a failed step
Tick approval on a SharePoint list row, send DOCX to a client
Read the runbook + watch the Loom walkthrough
Triage a Teams alert (which integration is down? when was the last green heartbeat?)

Day-30 — they can

Evolve the rules

Edit a Word template — new column, different heading — via git PR
Tweak a skill prompt (open SKILL.md in any editor; CI guards it)
Change a workflow's schedule, add a Slack notification, add a filter
Read the audit trail to explain a model decision

When bigger change needed

Bring in any low-code dev

The architecture is small enough that any competent low-code developer can pick it up. Skill markdown, n8n workflows, Word templates — all standard formats. Python services are 800-1500 LOC each.

The 30-day snagging window covers anything that breaks before the ops lead is comfortable.

§ 11 Extensibility what changes cost what

The cheap changes are deliberate. Anything Altus is likely to want next quarter is a markdown edit, a Word template change, or an n8n workflow add. The expensive changes are the rare ones.

Change	Where it happens	Effort
Tweak the AI prompt e.g. soften narrative tone	Edit `SKILL.md` in git, open PR, CI runs 6-fixture eval. Auto-syncs to Anthropic Platform on merge.	15 min
Change a Word template e.g. new client logo, new column	Edit `.docx` in git, sample-render attached to the PR comment for review.	30 min
Add a workflow e.g. weekly summary email	Build in n8n console — visual drag & drop, no code.	1 hr
Add a new skill e.g. RAMS-tone-checker	Create new directory in skills repo with `SKILL.md` + 6 eval fixtures. CI uploads, runners discover it via skill ID.	1 day
Swap inference provider e.g. to Bedrock / OpenAI	Fork `agent-svc-skilled`, replace the gateway, inline the skill body from git. Same `/run` contract for n8n.	3 days
Change the cloud e.g. Azure → AWS	All Bicep would become Terraform/CDK; Container Apps become ECS/Fargate. Code unchanged. Architecture mirrors what Softkrtl already runs on AWS.	~2 wks

§ 12 Risks eyes open

A short, honest list. Most are time-bounded or covered by the backup path. The two highlighted ones are the genuine fronts to watch.

Risk	What it means	Severity	Mitigation
Beta-API drift Anthropic Skills + Sessions still beta	The hosted-skills feature is in beta. Anthropic could change request shape on short notice.	Watch	Pinned SDK versions, release-note subscription. Skill body in git, swappable to inlined-Messages path in a few days if outage.
Single environment staging IS production for PoC	No separate `altus-prod-rg`. A breaking deploy in staging affects real audits once onboarded.	Known	Deploy script digest-pins every image; manual gate; rollback is a previous-revision activate.
Vendor coupling Anthropic Platform for skills hosting	Skills hosting only available from Anthropic. Provider change would require swapping.	Hedged	Skill source-of-truth is git, not Anthropic. Rebuild as Messages-API-only runner takes days, not weeks.
iAuditor API access Premium tier may be required	If API access proves harder than expected, real audits cannot trigger automatically.	Open	SharePoint drop-folder fallback already wired; manual entry path runs end-to-end identical.
Trial-clock expiries SafetyCulture 2026-05-31, M365 2026-06-16	Vendor trials end soon; renewal vs cancel decisions required.	Calendar	Already flagged on runbook; commercial conversation before each date.
Coordinator-edit rate how often a human rewrites the AI	If the rate stays above 20% after shadow period, model output isn't good enough — trust erodes.	Watch	Captured in `reasoning_calls`; eval corpus grows from every disagreement; human-in-loop always present.
ZDR not available on Anthropic Skills	Hosted skills are not Zero Data Retention eligible. Standard retention applies.	Accepted	Explicit user decision; revisit only if compliance posture changes.